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Abstract: With the ever-growing availabihty of so-called complex data, especially 
on the Web, decision-support systems such as data warehouses must store and pro- 
cess data that are not only numerical or symbolic. Warehousing and analyzing such 
data requires the joint exploitation of metadata and domain-related knowledge, which 
must thereby be integrated. In this paper, we survey the types of knowledge and meta- 
data that are needed for managing complex data, discuss the issue of knowledge and 
metadata integration, and propose a CWM-compliant integration solution that we in- 
corporate into an XML complex data warehousing framework we previously designed. 
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1 Introduction 

Decision-support technologies, and more particularly data war ehousing IIInm021 IKR021 
IJLVV03L are nowadays technologically mature. Data warehouses are aimed at monitor- 
ing and analyzing activities that are materialized by numerical measures (facts), while 
symbolic data describe these facts and constitute analysis axes (dimensions). However, 
in real Ufe, many decision-support fields (customer relationship management, marketing, 
competition monitoring, medicine...) need to exploit data that are not only numerical or 
symbolic. For example, computer-aided diagnosis systems might require the analysis of 
various and heterogeneous data, such as patient records, medical images, biological anal- 
ysis results, and previous diagnoses stored as texts fSaaOT]. We term such data complex 
data [DBRAOSJ. Their availability is now very common, especially since the broad devel- 
opment of the Web, and more recently the Web 2.0 (blogs, wikis, multimedia data sharing 
sites...). 

Complex data might be structured or not, and are often located in several, heterogeneous 
data sources. Specific approaches are needed to collect, integrate, manage and analyze 
them. A data warehousing solution is interesting in this context, though adaptations are 
obviously necessary to take into account data complexity (measures might not be numer- 



ical, for instance). Data volumetry and dating are also other arguments in favor of the 
warehousing approach. 

In this context, metadata and domain-related knowledge are essential in the processing of 
complex data and play an important role when integrating, managing, and analyzing them. 
In this paper, we address the issue of jointly managing knowledge and metadata, in order 
to warehouse complex data and handle them, at three different levels: at the supplier level 
(data providers), to identify all input data sources and the role of source type drivers; at 
the user level (consumers), to identify all data sources for analysis and their source type 
drivers; at the manager level (administrators), to achieve good performance. 

Since data warehouses traditionally handle knowledge under the form of metadata, we 
discuss the alternatives for integrating domain-related knowledge and metadata. Our po- 
sition is that knowledge should be integrated as metadata in a complex data warehouse. 
On this basis, we also present an XML-based architecture framework for complex data 
warehouses that expands the one we proposed in IIDBRA05I . 

The remainder of this paper is organized as follows. In Section |2l we survey the various 
kinds of knowledge and metadata that are required for managing complex data. In Sec- 
tion[3] we discuss the issue of knowledge and metadata integration, justify our choice, and 
present our revised architecture framework for complex data warehouses. In SectionlH we 
summarize the state of the art regarding knowledge and metadata integration. We finally 
conclude this paper and provide future research directions in Section|5] 



2 Knowledge and Metadata Needs 
2.1 Knowledge Types 

Two types of knowledge must be taken in consideration: tacit and explicit knowledge 
IINSIH02I . Tacit knowledge includes beliefs, perspectives and mental models. Explicit 
knowledge is knowledge that can be expressed formally using a language, symbols, rules, 
objects or equations, and thus can be communicated to others. In data warehousing envi- 
ronments, we are particularly interested in explicit knowledge. 

Then, different kinds of questions must be considered regarding the types of knowledge 
that are needed to manage complex data warehouses. These questions determine the de- 
scription context (what), the organizational context (who, where and when), the processing 
context (how) and the motivation and business rules (why). 

Responses to the "what"-type question describe business concepts. These elements guide 
the link between metadata and knowledge; while knowledge representation uses metadata 
contents and structure. The "how" and "why" questions relate to each process' motivation 
and the way it operates, in comparison to an existing organization. Eventually, answering 
to the "who", "where" and "when" questions helps in connecting the first two categories 
of questions to a particular organization. 

Furthermore, one type of knowledge that is often forgotten is universal or background 



knowledge. For example, the number of days in a month, the work scheduler with wrought 
days, public holidays, constitute some background knowledge that is essential for decision 
or analytical queries. 

We must also consider statistical knowledge, which may include descriptive statistics 
about the data warehouse contents, or hypotheses about attributes' characteristics, such 
as probabilistic laws or sampling methods. Statistical knowledge may be provided by data 
analysis or data mining, and results should be reinjected into the system. 

Technical knowledge is also very important at different phases of the data warehouse life- 
cycle. At a high level of abstraction, it is closely related to metadata. Technical knowl- 
edge includes knowledge about data sources and targets, standard and specific data types, 
database management systems (DBMSs), software and hardware platforms, technologies, 
etc. Indexing techniques available in each DBMS belong to this type of knowledge too. 

Closely related is knowledge about organizational and geographical deployment, which 
includes information about users, their needs, their attributions and their constraints in 
regard to their needs (e.g., in terms of response time, volume of processed data, result 
format, etc.). 

The last kind of knowledge we must consider relates to data warehouse administration. It 
provides information about how the data warehouse is used (access statistics) and how the 
interface between the data warehouse and the operational systems articulates, i.e., what the 
transactional applications and their characteristics (frequencies, response times, users...) 
are; and what the major Extracting, Transforming and Loading (ETL) problems (planifi- 
cation to satisfy user requirements with respect to work schedule, identification of peak 
periods...) are. The refreshment policies of the data warehouse contents are also important 
here, since they dictate the rotation period of summary data, the purge period and dormant 
data determination. 



2.2 Metadata Types 

We identified five transversal and complementary classifications for metadata in the lit- 
erature. In the first classification |HMTOO|, metadata are classified based on the data 
warehouse architecture layers, as follows: 

• metadata associated with data loading and transformation, which describe the source 
data and any changes operated on data; 

• metadata associated with data management, which define the data stored in the data 
warehouse; 

• metadata used by the query manager to generate an appropriate query. 
The second classification IIHMTOOl divides metadata into: 

• technical metadata that support the technical staff and contain the terms and defini- 
tion of metadata as they appear in operational databases; 



• business metadata that support business end-users who do not have any technical 
background; 

• information navigator metadata, which are tools that help users navigate through 
both the business metadata and the warehoused data. 

In the third classification BHMTOOI . metadata may be: 

• static metadata that are used to document or browse the system; 

• dynamic metadata that can be generated and maintained at run time. A new kind of 
metadata is made of metadata that handle the mapping between systems. 

In the fourth classification BKimOSL metadata may be: 

• system catalog metadata or data descriptors; 

• relationship metadata that store information about the relationships between data 
entities (primary key/foreign key relationships, generalization/specialization rela- 
tionship, aggregation relationship, inheritance relationships and any other special 
semantic relationship implying update or delete dependency); 

• content metadata formed by descriptions of the contents of stored data at an arbitrary 
granularity. Content metadata may be as simple as one keyword, or as complex as a 
business rules, formulae or links to whole documents; 

• data lineage metadata, which are lifecycle data about stored data (information about 
the creation of data, subsequent updates, transformation, versioning, summarization, 
migration, and replication, transformation rules, and descriptions of migration and 
replication); 

• technical metadata that store technical information about stored data: format, com- 
pression or encoding algorithm used, encryption and decryption algorithms, encryp- 
tion and decryption keys, software used to create or update the data. Application 
Programming Interfaces (APIs) available to access the data, etc.; 

• data usage metadata or business data that are descriptions of how and for what pur- 
poses the data are to be used by users and applications; 

• system metadata that are descriptions about the overall system environment, includ- 
ing hardware, operating system and application software; 

• process metadata that describe the processes in which the applications operate, and 
any relevant output of each step of these processes. 

Eventually, the fifth classification we identified fSE06l is based upon functionality cate- 
gories: infrastructure, data model, process, quality, interface and administration. 

• Infrastructure metadata contain information on system components. 



• Data model metadata (also called data dictionary) include definitions of data entities 
and the relationships among them. 

• Process metadata capture information on data generation and transfer from sources 
to targets. 

• Quality metadata contains information on the actual data stored and helps in assess- 
ing data quality (e.g., factual measurements). 

• Interface metadata (also called reporting metadata) support data delivery to end- 
users. 

• Finally, administration metadata include data that are necessary for administering 
the data warehouse and its associate applications (security, authentification, usage 
tracking...). 

To conclude this section, we cite an important standardization initiative: the Common 
Warehouse Metamodel (CWM |Gro03|). CWM has been established by the Object Man- 
agement Group (OMG) within its framework of Meta-Object Facilities (MOF). CWM 
purposes a metamodel that can be instantiated to obtain an operational data warehouse. 
Each of the metadata types we enumerated in the above classifications should be mapped 
into one or several CWM components. 



3 Knowledge and Metadata Integration for Complex Data Warehous- 
ing 

3.1 Integrating Knowledge and Metadata 

Current data warehouse architectures are based on metadata. However, they are some- 
times themselves a materialization of domain-related knowledge that facilitates the man- 
agement of data warehouses and helps in achieving good performance. It is difficult for 
classical architectures to manage complex data without domain-related knowledge nor 
background knowledge. For example, a data warehouse administrator needs some back- 
ground, domain-related knowledge in addition to metadata to select clustering or indexing 
techniques. 

There are three possibilities to jointly manage knowledge and metadata: coding and repre- 
senting knowledge as metadata; modelling metadata to match knowledge representation; 
managing metadata and knowledge separately. The advantages and drawbacks of each 
possibility are discussed below. 

Coding and representing knowledge as metadata present an important advantage: we can 
keep on using and maintain current architectures and techniques. However, it is neces- 
sary to find a solution for knowledge representation, a kind of mapping between classical 
knowledge representation and metadata implementation. 



Modelling metadata to match knowledge representation hedges on the domain of knowl- 
edge warehouses PNSIH021, which supposes important adaptations and new considera- 
tions about current architectures. Some metadata cannot be converted into knowledge and 
there is a risk to loose some information. Moreover, finding a knowledge representation 
that can accept actual metadata is not obvious. 

As for the third possibility, i.e., managing metadata and knowledge separately, a great 
change of architecture would be essential, because a structure that allows to coordinate 
and to compile metadata and knowledge contents must be devised. Instead of reducing 
complexity, this solution would increase it with the consideration of a new element: man- 
aging the connection between knowledge and metadata. 

In conclusion, in order to build upon the assets of current data warehouse architectures, in 
particular in terms of performance, we select the first solution and explore it in this paper. 



3.2 Revised Architecture Framework for Complex Data Warehousing 
3.2.1 Global Architecture 

In HDBR AQSI, we have already proposed an architecture framework for complex data 
warehouses. The main components of this architecture (Figure[T]) are: the data warehouse 
kernel, which may be either materialized as an XML warehouse, or virtual (where cubes 
are computed at run time); operational databases; source type drivers that notably include 
mapping specifications between the sources and XML; and finally a metadata and knowl- 
edge base layer that includes three submodules related to three management processes. 

These three processes for managing a data warehouse are: 

1 . the ETL and integration process that feeds the warehouse with source data from the 
operational databases (OD) by using drivers that are specific to each source type 
(ST); 

2. the administration and monitoring process (AID & KB) that manages metadata 
and knowledge (the administrator interacts with the data warehouse through this 
process); 

3. the analysis and usage process that runs user queries, produces reports, builds data 
cubes, supports On-Line Analytical Processing (OLAP), etc. (result data RD). 

Each of these processes exploits and updates the metadata and the knowledge base through 
four types of flows: 

1 . the external flow, which includes the ETL and integration flow and the exploitation 
(analysis and usage) flow (the warehouse may thus be considered as a black box); 

2. the internal flow, between the warehouse kernel and the metadata and knowledge 
base layer, and between the metadata and knowledge base layer and the source type 
drivers; 




Figure 1 : Complex Data Warehouse Architecture Framework 

3. the metadata and knowledge management and maintenance flow, which acquires 
new knowledge and enriches existing knowledge; 

4. the reference flow, which illustrates the fact that the external flow always refers to 
the metadata and knowledge base layer for integration, ETL, and analysis and usage 
in general. 

The symmetric aspect between "sources" and "usages" around the data warehouse core al- 
lows us to eventually re-inject results as data sources. For instance, a data mining analysis 
may discover dependencies between variables and highlight causal relationships among 
them. We do use such techniques to determine the relevance of complex data with respect 
to given analysis goals. Then, knowledge obtained by mining can be integrated into the 
metadata repository and later re-used in the definition of complex data cubes. 

3.2.2 Core Interface 

In this section, we expand the architecture framework presented in Section [3.2. ll bv in- 
tegrating knowledge and metadata. Around the data warehouse core, with respect to the 
external components (operational data sources, result data stream and administration and 
monitoring), we define three metadata and knowledge base (AID & KB) repositories 



corresponding to the three sides of the core (Figure|2]l. They constitute an interface func- 
tionahty. 




Figure 2: Interface around the core 

The first MD & KB repository (labeled ( 1 ) in Figure |2]i lies at the data integration and 
ETL process level, and includes: 

• an ontology for modelling domain-related knowledge; 

• information about data sources and source types; 

• mappings for the extraction and transformation processes (the E and T in ETL); 

• information about the loading (the L in ETL: frequency, mode...) and cleansing 
(purge) processes; 

• a referential or metadata repository about data, materialized views, index, clusters, 
aliases, etc. 

The M D & KB repository that is labeled (3) in Figure |2] lies at the administration and 
monitoring level, and references : 

• an ontology for modelling domain-related knowledge; 

• deployment, hardware and software constraints; 

• an interface between the integration and ETL level and the usage level; 

• information on users and data providers; 

• data warehouse usage information (statistics, response time, availability, feedback, 
dormant data...); 

• a referential or metadata repository about data, materialized views, index, clusters, 
aliases, etc. 



Eventually, the MD & KB that is labeled (2) in Figure |2] lies at the usage level and 
completes our interface with: 

• an ontology for modelling domain-related knowledge; 

• information about aggregate operators (hierarchical lattice construction MPei03l if 
necessary) and data lineage that would allow users to go up to the sources if neces- 
sary; 

• query optimizer data (query reformulation and rewriting); 

• a referential or metadata repository about data, materialized views, index, clusters, 
aliases, etc. 

Note that some of the elements we have just enumerated (e.g., ontology and referential 
repository) are present in more than one interface. Hence, they must be factorized at a 
higher level (labeled (4) in Figure |2]l. Moreover, this level must include metaknowledge, 
i.e., knowledge for acquiring, expressing, using, storing, retrieving knowledge, and even 
creating new knowledge. The major part of this level resides within the CWM repository. 

3.2.3 XML as a Pivot Language 

The architecture we propose necessitates a universal formalism so that all its components 
(core, metadata, knowledge, drivers, interface, data and knowledge interchange...) can in- 
teroperate. With its vocation for semi-structured data exchange, the extensible Markup 
Language (XML) already offers a great flexibility to represent complex data, and great 
possibilities for structuring, modelling, and storing them llDBB+031 . XML indeed allows 
to store together data and their description, either implicitely or through a schema defini- 
tion. This type of representation is particularly useful in a data warehousing environment 
where such metadata are casual. Furthermore, many XML and MOF-related facilities, 
such as the XML Metadata Interchange (XMI IGroOSI ) or the Common Warehouse Meta- 
data Interchange (CWMI), can help in managing metadata in an XML data warehouse and 
specify source-type drivers, while ensuring CWM compliance. 

CWM compliance is ensured by the CWM repository that is integrated into the data ware- 
house kernel. All MD & KB modules use this repository to communicate with the data 
warehouse. CWM, through its five metamodels (object, foundation, resource, analysis and 
management), provides UML components (classes, associations and packages) for mod- 
elling all the data warehouse's elements ||Gro03L Table [1] illustrates the correspondences 
between the AID & KB modules in our architecture and the CWM metamodels. 

EventuaUy, the advances in XML warehousing MPok02l IHBH031 IRRT05I IBMCA06I ren- 
der this solution plausible in the near future, especially since XML-related metadata in- 
terchange facilities integrate very well in data warehouses IIAvM02ll . Storage possibilities 
are also numerous, either into relational, XML-compatible DBMSs such as Oracle, SQL 
Server or DB2; or into XML-native DBLSs such as Lore, eXist or X-Hive. Furthermore, 
XML query languages such as XQuery allow the formulation of analytical queries that 
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Table I: MD k KB and CWM correspondences 



are intricate to express in a relational system, e.g., moving window aggregations or rollup 
operations on ragged hierarchies lBCC+051 . Hence, our XML-based framework provides 
an architecture that is both extensible and "stable", and that can be compliant with future 
external elements (data sources, analytical techniques and usages...). 



4 Stateof the Art 

Though the litterature about metadata and knowledge is abundant, the issue of integrat- 
ing metadata and knowledge is scarcely addressed. In this section, we provide a quick 
overview of the studies that are nonetheless related to this problem. Metadata are always 
present in data warehouse architectures l,Inm02J . In our particular context, some interest- 
ing efforts aim at decentralizing the management of metadata into functional components 
of data warehouses P HMTOOl IKim05l ISE06I . They do not address the issue of domain- 
related knowledge, though. 

Knowledge is indeed rarely exploited as such in data warehouse environments. How- 
ever, issues related to knowledge management in the context of heterogeneous data ware- 
house environments have been addressed, by augmenting a federated warehouse with a 
knowledge repository IIKerOIII . Discussions about using knowledge as a basic element for 
managing metadata are also regularly discussed in 0SteO7L However, this issue is mostly 
addressed by the knowledge management community, which works on knowledge ware- 
houses INSIH02 WAK05I, and whose focus is obviously knowledge. 

Finally, a study from the field of Geographical Information Systems (GISs, which are 
premium providers of complex data) is of particular interest to us. An extension of current 
metadata schemes has indeed been proposed to include context-based and tacit information 
about semantic attributes |SL06|. These ontology-based extended metadata improve data 
selection and interoperability decisions. Though we are more particularly interested in 
explicit knowledge in our context, we can exploit this solution in our framework. 



5 Conclusion 

In this paper, we have underlined the growing need for warehousing so-called complex 
data, a task that requires the management of knowledge and metadata related to these data. 



We enumerated the various kinds of knowledge and metadata that must be taken into ac- 
count. On this basis, we proposed to integrate knowledge as metadata in the warehouse. 
Finally, we expanded an XML-based, CWM-compliant architecture framework for com- 
plex data warehouses we had previously proposed in the light of the new insights discussed 
in this paper. 

One immediate perspective of our work is to validate our present proposal by experi- 
mentation, and to evaluate the impact of metadata and knowledge integration in complex 
data warehouses in terms of performance. Performing performance evaluations and com- 
parisons, basically with and without integrating knowledge and metadata, shall show the 
actual relevance of our solution. 

A related, important follow-up of our work is to assess the consequences of metadata 
and knowledge integration on traditional performance optimization techniques such as 
view materialization, indexing, partitioning, query optimization, etc. These techniques 
will presumably need to be adapted to take into account domain-related knowledge and 
achieve the best performance. 

Eventually, our position in this paper is to manage metadata and knowledge integration 
by representing knowledge as metadata. Though we discussed arguments in favor of this 
particular approach in Section lTTl it would be interesting to explore and assess the efficacy 
of the other possible solutions, namely representing metadata as knowledge or managing 
knowledge and metadata separately. 
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